Note to the Reader¶
To keep this explainer notebook as clean and readable as possible, several longer Python functions have been moved to separate files located in the
/utilitiesfolder.Our full project, including code and website source code, is publicly available on GitHub:
https://github.com/Andreas-Holm-2/02467-Project-assignmentOur website is online at https://andreas-holm-2.github.io/02467-Project-assignment/
The datasets used in this project can be accessed here:
Spotify Artist Collaboration Network (Large Dataset):
US Top 10K Artists and Their Popular Songs (Smaller Dataset):
Motivation
As music plays a significant role in most people’s lives—including our own—it was naturally intriguing for us to explore this domain. The goal of this project is to investigate whether there are any significant differences in the way artists within different music genres collaborate. More specifically, we were interested in examining whether collaboration patterns differ between the two highly collaborative genres: pop and rap. We hypothesize that these patterns do indeed vary, and we aim to explore what this reveals about the culture within each genre. We believe that network science is the perfect tool to help us answer this question. Moreover, we want to compare linguistic styles and themes within each genre.
In this project, we constructed our own dataset by utilizing two different sources. The first was a large dataset containing approximately 156,000 artists and recorded collaborations between them. We found this dataset ideal for our project, as it included the key features we needed for constructing a network: artist names, follower counts (allowing us to filter out smaller or duplicate artists), and genre labels (enabling us to compare genres within the network).
In addition, we used a smaller dataset containing a list of the 10,000 most listened-to artists within the US. This dataset allowed us to narrow our analytical focus, excluding less influential or smaller artists. We intersected this list with the larger dataset to retain only relevant artists. Various additional preprocessing steps were employed to ensure a clean and usable dataset, which are described further below.
Our datasets did not naturally contain any text-data. To address this, we used our knowledge from class to leverage the Genius API in order to collect lyrics for each artist. Our approach was to gather lyrics from each artist’s three most popular songs, as these provide a good representation of their musical style and identity.
# Import of the most central dependicies
import pandas as pd
import matplotlib.pyplot as plt
import ast
import networkx as nx
import community as community_louvain
from collections import Counter, defaultdict
import netwulf as nw
import numpy as np
import re
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from scipy.stats import chi2
from nltk.tokenize import MWETokenizer
from wordcloud import WordCloud
import math
import pickle
import sys
sys.dont_write_bytecode = True
Basic stats. Let's understand the dataset better
Data cleaning and preprocessing¶
# Spotify Artist Collaboration Network (Large Dataset)
nodes_df = pd.read_csv('nodes.csv')
edges_df = pd.read_csv('edges.csv')
nodes_df['genres'] = nodes_df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else [])
nodes_df
| spotify_id | name | followers | popularity | genres | chart_hits | |
|---|---|---|---|---|---|---|
| 0 | 48WvrUGoijadXXCsGocwM4 | Byklubben | 1738.0 | 24 | [nordic house, russelater] | ['no (3)'] |
| 1 | 4lDiJcOJ2GLCK6p9q5BgfK | Kontra K | 1999676.0 | 72 | [christlicher rap, german hip hop] | ['at (44)', 'de (111)', 'lu (22)', 'ch (31)', ... |
| 2 | 652XIvIBNGg3C0KIGEJWit | Maxim | 34596.0 | 36 | [] | ['de (1)'] |
| 3 | 3dXC1YPbnQPsfHPVkm1ipj | Christopher Martin | 249233.0 | 52 | [dancehall, lovers rock, modern reggae, reggae... | ['at (1)', 'de (1)'] |
| 4 | 74terC9ol9zMo8rfzhSOiG | Jakob Hellman | 21193.0 | 39 | [classic swedish pop, norrbotten indie, swedis... | ['se (6)'] |
| ... | ... | ... | ... | ... | ... | ... |
| 156417 | 2ces6d2YsQP1RpGMYpdFy8 | David Urwitz | 5470.0 | 29 | [classic swedish pop] | NaN |
| 156418 | 6AeznZajNbXUulT7W4tK5l | Darmiko | 2022.0 | 23 | [] | NaN |
| 156419 | 3GEijIjrgb4lPe9WtURBzz | Katriell | 268.0 | 0 | [] | NaN |
| 156420 | 0ldQL0icSoMz9OOZcWG8Zt | Yung Fresh | 181.0 | 19 | [] | NaN |
| 156421 | 1QZqarAGs0Lxx495oNcBnZ | Rakshitha Rao | 23.0 | 24 | [] | NaN |
156422 rows × 6 columns
Explaining the attributes
The notebook uses of the following attributes: spotify_id, name, followers, popularity, genres and chart_hits. Some of these are self-explanatory, however the following needed further clarification:
Popularity, a value between 0-100, is a heuristic calculated by spotify that is primarly based on total number of streaming counts, artist chart positions and how recent those are recieved.Genresis the artists collection of areas he/she creates music in. Relevant for this project is discovering many different POP and RAP genres, such as drill-rap and k-popChart_hitsis a metric showing how high songs have been rated on different countries hit-listsReference: https://developer.spotify.com/documentation/web-api/reference/get-an-artist
# US Top 10K Artists:
artists_us_df = pd.read_csv("most_listened_artists_in_US_dataset.csv", index_col=0)
artists_us_df
| ID | Gender | Age | Country | Genres | Popularity | Followers | URI | |
|---|---|---|---|---|---|---|---|---|
| Name | ||||||||
| Drake | 3TVXtAsR1Inumwj472S9r4 | male | 33 | CA | ['canadian hip hop', 'canadian pop', 'hip hop'... | 95 | 83298497 | spotify:artist:3TVXtAsR1Inumwj472S9r4 |
| Post Malone | 246dkjvS1zLTtiykXe5h60 | male | 25 | US | ['dfw rap', 'melodic rap', 'pop', 'rap'] | 86 | 43130108 | spotify:artist:246dkjvS1zLTtiykXe5h60 |
| Ed Sheeran | 6eUKZXaKkcviH0Ku9w2n3V | male | 29 | GB | ['pop', 'singer-songwriter pop', 'uk pop'] | 87 | 115998928 | spotify:artist:6eUKZXaKkcviH0Ku9w2n3V |
| J Balvin | 1vyhD5VmyZ7KMfW5gqLgo5 | male | 35 | CO | ['reggaeton', 'reggaeton colombiano', 'trap la... | 83 | 38028010 | spotify:artist:1vyhD5VmyZ7KMfW5gqLgo5 |
| Bad Bunny | 4q3ewBCX7sLwd24euuV69X | male | 26 | PR | ['reggaeton', 'trap latino', 'urbano latino'] | 95 | 77931484 | spotify:artist:4q3ewBCX7sLwd24euuV69X |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| IVAN$ITO | 0cPmSFfjpop7imPVWSk2yc | NaN | 0 | NaN | [] | 20 | 4249 | spotify:artist:0cPmSFfjpop7imPVWSk2yc |
| Antonello Venditti | 3hYLJPJuDyblFKersEaFd6 | male | 71 | IT | ["canzone d'autore", 'classic italian pop', 'i... | 54 | 778642 | spotify:artist:3hYLJPJuDyblFKersEaFd6 |
| Lea Salonga | 1GlMjIezcLwV3OFlX0uXOv | female | 49 | PH | ['movie tunes', 'tagalog worship'] | 56 | 179832 | spotify:artist:1GlMjIezcLwV3OFlX0uXOv |
| Vertical Horizon | 6Hizgjo92FnMp8wGaRUNTn | mixed | 29 | NaN | ['neo mellow', 'pop rock', 'post-grunge'] | 48 | 431277 | spotify:artist:6Hizgjo92FnMp8wGaRUNTn |
| Lakko | 48wDYb8J9emrtnfRJvYEdZ | male | 0 | ES | [] | 21 | 21571 | spotify:artist:48wDYb8J9emrtnfRJvYEdZ |
9488 rows × 8 columns
artists_us_name_list = (artists_us_df.index).tolist() # Convert the list of names into list allowing us to find the intersection next
print(artists_us_name_list[:5]) # Security check
['Drake', 'Post Malone', 'Ed Sheeran', 'J Balvin', 'Bad Bunny']
We will now be constructing our dataset as a intersection between the large spotify collaration dataset and
artists_us_name_list
nodes_df = nodes_df[nodes_df["name"].isin(artists_us_name_list)]
nodes_df
| spotify_id | name | followers | popularity | genres | chart_hits | |
|---|---|---|---|---|---|---|
| 0 | 48WvrUGoijadXXCsGocwM4 | Byklubben | 1738.0 | 24 | [nordic house, russelater] | ['no (3)'] |
| 1 | 4lDiJcOJ2GLCK6p9q5BgfK | Kontra K | 1999676.0 | 72 | [christlicher rap, german hip hop] | ['at (44)', 'de (111)', 'lu (22)', 'ch (31)', ... |
| 15 | 3xs0LEzcPXtgNfMNcHzLIP | Rockwell | 40344.0 | 58 | [] | ['us (1)', 'gb (1)', 'at (1)', 'be (1)', 'ca (... |
| 20 | 2NUz5P42WqkxilbI8ocN76 | Vybz Kartel | 1026598.0 | 63 | [dancehall, jamaican dancehall, reggae fusion] | ['cr (3)', 'pa (1)'] |
| 22 | 4Lm0pUvmisUHMdoky5ch2I | Apocalyptica | 864846.0 | 60 | [alternative metal, bow pop, cello, finnish me... | ['fi (2)'] |
| ... | ... | ... | ... | ... | ... | ... |
| 156038 | 7p5J8SfKU9Rulp7tcA53G8 | Jose Merce | 182186.0 | 51 | [cante flamenco, flamenco, nuevo flamenco, rumba] | NaN |
| 156119 | 3gJ0f9ov2Vjrbo9RnFFH76 | Endor | 267.0 | 2 | [scottish indie folk] | NaN |
| 156199 | 7mKmqnXqn1WoEFljKyvAHR | 2T FLOW | 20.0 | 29 | [] | NaN |
| 156220 | 6kT18gnkVrCz8xJQcrib7L | Bhaskar | 230894.0 | 60 | [brazilian bass, brazilian edm, brazilian house] | NaN |
| 156330 | 0Wkm45quqfx3NepJpXDvwE | Superorganism | 225468.0 | 51 | [art pop] | NaN |
9754 rows × 6 columns
By inspecting the dataframe, we notice there are more artists in the intersection (9,754) than in the dataset containing the most listened artists in the US (9,488). This indicates there are duplicates in the dataset. We will inspect this:
# Sorting for duplicated within the intersection
duplicate_names = nodes_df[nodes_df.duplicated(subset="name", keep=False)].sort_values(by="name")
duplicate_names
| spotify_id | name | followers | popularity | genres | chart_hits | |
|---|---|---|---|---|---|---|
| 43592 | 1ItNxpDdetHb2gyS10HKfF | 18 Karat | 1644.0 | 18 | [] | NaN |
| 13282 | 5oWFxbBrbk2Mw86PLUg3OZ | 18 Karat | 292544.0 | 53 | [deep german hip hop, german hip hop, german u... | ['at (16)', 'de (30)', 'lu (3)', 'ch (12)'] |
| 57139 | 09ZUMxxU5pgzUF0FtHeGXG | 19HUNNID | 5.0 | 1 | [] | NaN |
| 59493 | 2Zm4abMQXwcrsM9IWY3AoB | 19HUNNID | 513.0 | 31 | [thai trap] | NaN |
| 156199 | 7mKmqnXqn1WoEFljKyvAHR | 2T FLOW | 20.0 | 29 | [] | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 146823 | 1wEQjEpK6KfE9Et2ZZBfPa | Żabson | 43.0 | 4 | [] | NaN |
| 24780 | 0FtUnl4AjR5eDa1v77WU0b | ปู่จ๋าน ลองไมค์ | 138.0 | 17 | [] | NaN |
| 95174 | 23YWwPEvaO5HLqEMgiUXJa | ปู่จ๋าน ลองไมค์ | 23424.0 | 32 | [] | NaN |
| 6081 | 3dTgjg7lzUGiD3NwcGCK1n | 阿冗 | 44494.0 | 48 | [chinese viral pop, mainland chinese pop] | ['my (1)', 'sg (1)', 'tw (4)'] |
| 8241 | 7sD5pBZNNSDMfiF2BvRem7 | 阿冗 | 671.0 | 31 | [] | ['tw (1)'] |
1739 rows × 6 columns
As evident from our dataset, there are a significant number of duplicate artist names. This is primarily due to the scale of the Spotify dataset, which contains over 156,000 artists. Another contributing factor is that Spotify does not enforce unique artist names—meaning multiple users, including lesser-known or negligible artists, can share the same name.
For example, take the artist name Drake, which belongs to one of the most popular artists on the platform. However, because names on Spotify are not unique, a relatively unknown user could also appear under the name Drake, causing confusion and leading to duplicates in our dataset.
To resolve this, we implement a simple disambiguation method: for each duplicated name, we retain only the artist with the highest number of Spotify followers, assuming this is the most prominent and relevant entry.
clean_df = nodes_df.sort_values("followers", ascending=False)
clean_df = nodes_df.drop_duplicates(subset="name", keep="first")
clean_df
| spotify_id | name | followers | popularity | genres | chart_hits | |
|---|---|---|---|---|---|---|
| 0 | 48WvrUGoijadXXCsGocwM4 | Byklubben | 1738.0 | 24 | [nordic house, russelater] | ['no (3)'] |
| 1 | 4lDiJcOJ2GLCK6p9q5BgfK | Kontra K | 1999676.0 | 72 | [christlicher rap, german hip hop] | ['at (44)', 'de (111)', 'lu (22)', 'ch (31)', ... |
| 15 | 3xs0LEzcPXtgNfMNcHzLIP | Rockwell | 40344.0 | 58 | [] | ['us (1)', 'gb (1)', 'at (1)', 'be (1)', 'ca (... |
| 20 | 2NUz5P42WqkxilbI8ocN76 | Vybz Kartel | 1026598.0 | 63 | [dancehall, jamaican dancehall, reggae fusion] | ['cr (3)', 'pa (1)'] |
| 22 | 4Lm0pUvmisUHMdoky5ch2I | Apocalyptica | 864846.0 | 60 | [alternative metal, bow pop, cello, finnish me... | ['fi (2)'] |
| ... | ... | ... | ... | ... | ... | ... |
| 155870 | 3a9qv6NLHnsVxJUtKOMHvD | The Glitch Mob | 538974.0 | 61 | [edm, electro house, glitch, glitch hop, indie... | NaN |
| 156036 | 7dh6G6qILmRpUtZU4ZSD4D | Trobeats | 515.0 | 9 | [] | NaN |
| 156038 | 7p5J8SfKU9Rulp7tcA53G8 | Jose Merce | 182186.0 | 51 | [cante flamenco, flamenco, nuevo flamenco, rumba] | NaN |
| 156220 | 6kT18gnkVrCz8xJQcrib7L | Bhaskar | 230894.0 | 60 | [brazilian bass, brazilian edm, brazilian house] | NaN |
| 156330 | 0Wkm45quqfx3NepJpXDvwE | Superorganism | 225468.0 | 51 | [art pop] | NaN |
8756 rows × 6 columns
Now, as we can see the intersection contains only 8756 rows (before 9754), meaning we succesfully removed 998 duplicate artists with the least amount of followers. We have now reached our dataset
clean_dfonly consisting of the most listened artists excluding duplicates. This serves as a base graph that will be partitioned into apop_dfandrap_df.
Each artist is mapped to the genre that he/she participates mostly in. The function counts the occurences of "pop" and "rap" in the artists
genresproperty. Then the genre that occurs mostly the artist is mapped to. This will prevent an artist to be present in both dataframes.
from utilities.Network_construction_functions import split_artists_by_primary_genre
pop_df, rap_df = split_artists_by_primary_genre(clean_df, ["pop", "rap"])
print(f'There are {len(pop_df)} artists in the constructed pop network')
print(f'There are {len(rap_df)} artists in the constructed rap network')
There are 4161 artists in the constructed pop network There are 1149 artists in the constructed rap network
Dataset statistics
In the following section we calculate the
Key statisticsfor the two networks:
Number of nodes
Number of edges
Network density
Number of isolated nodes
Is connected
Number of connected components
Size of largest component
Average shortest path length
Average clustering coefficient
Transitivity
Top collaborators (determined by largest degree)
from utilities.Network_construction_functions import get_Graph_with_names
G_pop = get_Graph_with_names(pop_df, edges_df, verbose=False)
G_rap = get_Graph_with_names(rap_df, edges_df, verbose=False)
from utilities.Network_statistics import print_network_statistics
print('Pop network statistics')
print_network_statistics(G_pop)
Pop network statistics Number of nodes: 4161 Number of edges: 13698 Density: 0.0015826909211912818 Number of isolated nodes: 758 Is connected: False Number of connected components: 801 Size of largest component: 3286 Average shortest path length (largest component): 5.257677282248106 Average clustering coefficient: 0.13024735101867216 Transitivity (global clustering coefficient): 0.1403883617168353 Degree analysis Average Degree: 6.58 Median Degree: 3.0 Mode Degree: 0 Minimum Degree: 0 Maximum Degree: 162
print('Rap network statistics')
print_network_statistics(G_rap)
Rap network statistics Number of nodes: 1149 Number of edges: 6860 Density: 0.010401409497123692 Number of isolated nodes: 143 Is connected: False Number of connected components: 153 Size of largest component: 961 Average shortest path length (largest component): 4.373501994450225 Average clustering coefficient: 0.2818015338203428 Transitivity (global clustering coefficient): 0.32168006480962175 Degree analysis Average Degree: 11.94 Median Degree: 5.0 Mode Degree: 0 Minimum Degree: 0 Maximum Degree: 104
Examining degree distributions
from utilities.Network_statistics import plot_degree_distribution
from utilities.Network_statistics import plot_degree_distribution_log_log_scale
plot_degree_distribution(G_pop, "POP")
plot_degree_distribution(G_rap, "RAP")
The music genre-networks both seem to exhibit a Heavy-tailed distribution of degrees: there are few highly connected nodes. We will plot both in log-log scale as well.
plot_degree_distribution_log_log_scale(G_pop, "POP")
Power-law exponent: 1.913
plot_degree_distribution_log_log_scale(G_rap, "RAP")
Power-law exponent: 1.380
from utilities.Network_construction_functions import print_top_collaborators
print("POP-network")
print_top_collaborators(G_pop, 5)
POP-network Top 5 artists with the most collaborations: 1. R3HAB — 162 collaborations 2. David Guetta — 125 collaborations 3. Tiësto — 113 collaborations 4. Steve Aoki — 102 collaborations 5. Diplo — 88 collaborations
print("POP-network")
print_top_collaborators(G_rap, 5)
POP-network Top 5 artists with the most collaborations: 1. Gucci Mane — 104 collaborations 2. French Montana — 99 collaborations 3. Future — 98 collaborations 4. Young Thug — 95 collaborations 5. Lil Wayne — 93 collaborations
Discussing the dataset statistics and plots
Our pop Network seem to follow the power-law correlation relative well, with a power-law exponent of 1.913, slightly below the typical 2-3 range for scale-free networks. This indicates our network has a heavy-tailed degree distribution, consistent with many real-world networks where most nodes have few connections while a small number of nodes have many connections.
In comparison, the rap network has a lower power-law exponent of 1.38, indicating an even more skewed distribution. This implies that collaboration in rap is even more heavily concentrated among a few central artists. This suggests a small number of artists drive a large share of the collaborations.
The degree distributions of the pop and rap collaboration networks reveal noteworthy structural differences in how artists collaborate within each genre.
In the pop network, the average degree is 6.58, but the median is only 3 and the mode is 0, suggesting that while a few artists collaborate extensively, the majority are only loosely connected or not connected at all. This is confirmed by the high number of isolated nodes (758 out of 4161) and the fragmented structure with 801 components. The maximum degree of 162 indicates a small number of central pop artists with great influence, acting as hubs in an otherwise sparsely connected network.
In contrast, the rap network is noteably more cohesive. Despite being smaller in size, it shows a higher average degree of 11.94, and a median degree of 5, suggesting that rap artists tend to collaborate more frequently and more broadly than pop artists. The mode is still 0, and there are 143 isolated nodes, but this represents a smaller fraction of the total network compared to pop. The largest component in rap includes 961 out of 1149 nodes — over 80% of the network, which demonstrates a high level of connectivity. Additionally, the maximum degree of 104 still reflects the presence of a central artist as we saw in the pop network.
Overall, the pop network exhibits a more fragmented, with a large amount of disconnected or loosely connected nodes, while the rap network shows a tighter, more integrated structure, where collaboration is more widespread and balanced across the network.
Top collaborators and an interpretation
The top collaborators in both networks highlight interesting patterns. In pop, the most connected artists—R3HAB, David Guetta, Tiësto, Steve Aoki, and Diplo—are all prominent DJs and producers. This aligns with the structure of pop music, where producers often feature a wide range of vocal artists, leading to high collaboration counts.
In contrast, the rap network’s top collaborators: Gucci Mane, French Montana, Future, Young Thug, and Lil Wayne - are all vocal performers. This suggests that in rap, collaboration primarily occurs between performing artists themselves, further verifying the genre’s strong emphasis on features and collaborations.
Tools, theory and analysis
In this section we will cover the following
- The Louvain algorithm for discovering communities
- Visualize each of the networks with their communities using
Netwulf- Zooming in on North American communities
- Network metrics (modularity and assortativty)
- Text analysis (wordclouds)
Louvain algorithm
We use the Louvain algorithm to detect communities within the RAP and POP collaboration networks. It works by maximizing modularity, a measure of how densely connected nodes are within communities compared to between them.
Louvain iteratively groups nodes that increase modularity, then compresses communities into single nodes and repeats the process. This makes it both fast and effective, especially for large networks like ours.
It's well-suited for identifying collaborative clusters - such as crews, label groups, or stylistic circles - without needing any prior labels or assumptions.
pop_communities = community_louvain.best_partition(G_pop, random_state=20)
community_list = []
nx.set_node_attributes(G_pop, pop_communities, 'community')
community_sizes = Counter(pop_communities.values())
sorted_communities = sorted(community_sizes.items(), key=lambda x: x[1], reverse=True)
for community_id, size in sorted_communities:
community_list.append(f"Community {community_id}: {size} nodes")
print(community_list[:10])
['Community 1: 953 nodes', 'Community 6: 388 nodes', 'Community 9: 333 nodes', 'Community 2: 172 nodes', 'Community 16: 156 nodes', 'Community 0: 152 nodes', 'Community 18: 144 nodes', 'Community 25: 139 nodes', 'Community 53: 135 nodes', 'Community 13: 134 nodes']
rap_communities = community_louvain.best_partition(G_rap, random_state=20)
community_list = []
nx.set_node_attributes(G_rap, rap_communities, 'community')
community_sizes = Counter(rap_communities.values())
sorted_communities = sorted(community_sizes.items(), key=lambda x: x[1], reverse=True)
for community_id, size in sorted_communities:
community_list.append(f"Community {community_id}: {size} nodes")
print(community_list[:10])
['Community 1: 349 nodes', 'Community 2: 191 nodes', 'Community 0: 88 nodes', 'Community 8: 78 nodes', 'Community 4: 65 nodes', 'Community 30: 51 nodes', 'Community 3: 46 nodes', 'Community 18: 27 nodes', 'Community 60: 26 nodes', 'Community 32: 15 nodes']
We will now investigate where the artists in the different communites are.
from utilities.top_community_countries import print_top_community_country_distribution
print("Top 5 POP Communities by Size and Their Country Composition:")
print_top_community_country_distribution(G_pop, artists_us_df)
Top 5 POP Communities by Size and Their Country Composition: Community 1 (953 artists): US (28%), nan (22%), GB (15%), SE (5%), AU (4%), NL (3%), CA (3%), DE (3%), FR (2%), NO (2%), BE (1%), JP (1%), KR (1%), NZ (1%), IT (1%), IE (1%), DK (1%), JM (1%), ES (0%), RU (0%), EE (0%), PL (0%), BR (0%), IL (0%), CN (0%), PR (0%), LT (0%), RO (0%), IS (0%), CH (0%), MX (0%), FI (0%), VG (0%), AT (0%), GH (0%), TR (0%), AR (0%), GR (0%), AG (0%), XK (0%), CL (0%), CO (0%), TW (0%), NG (0%), MY (0%), ZA (0%), SN (0%), ID (0%), BA (0%), MA (0%), BF (0%), SI (0%) Community 6 (388 artists): ES (22%), nan (15%), CO (11%), MX (10%), US (8%), DO (4%), PR (4%), AR (3%), VE (3%), PE (3%), IT (2%), CA (2%), PA (2%), CL (2%), GB (2%), FR (2%), BR (1%), JP (1%), DE (1%), KR (1%), UY (1%), CU (1%), GT (1%), NL (0%), HU (0%), AT (0%), HN (0%), TR (0%), NI (0%), GR (0%), CZ (0%), NO (0%), PL (0%), SE (0%), ZA (0%) Community 9 (333 artists): SE (35%), nan (25%), DK (18%), NO (13%), US (3%), GB (2%), DE (2%), IT (1%), FI (0%), PL (0%), FR (0%), NZ (0%), IS (0%), AU (0%), BE (0%) Community 2 (172 artists): TW (32%), HK (20%), nan (15%), US (9%), JP (8%), CN (5%), SG (3%), MY (2%), KR (2%), GB (1%), CA (1%), PH (1%), NZ (1%), AU (1%), IT (1%), FI (1%), SE (1%) Community 16 (156 artists): nan (44%), NL (41%), US (3%), BE (3%), DE (2%), XE (2%), GB (1%), FI (1%), SE (1%), TN (1%), NO (1%), IT (1%), GH (1%), PR (1%), SR (1%)
print("Top 5 Rap Communities by Size and Their Country Composition:")
print_top_community_country_distribution(G_rap, artists_us_df)
Top 5 Rap Communities by Size and Their Country Composition: Community 1 (349 artists): US (54%), nan (38%), CA (2%), GB (1%), JP (1%), NL (1%), JM (1%), IT (1%), AT (0%), SE (0%), FR (0%), ZA (0%), AR (0%), SK (0%), KR (0%), CM (0%), PR (0%), AU (0%) Community 2 (191 artists): PR (35%), nan (28%), AR (7%), DO (5%), ES (5%), US (4%), CO (3%), CL (2%), GB (2%), MX (2%), CH (1%), BR (1%), DE (1%), JM (1%), RU (1%), PA (1%), LV (1%), CA (1%), BE (1%), VE (1%), NL (1%) Community 0 (88 artists): DE (38%), nan (36%), TR (20%), US (2%), FR (1%), NL (1%), JP (1%) Community 8 (78 artists): nan (46%), PL (36%), DE (5%), NL (4%), CZ (3%), US (3%), JP (1%), SK (1%), HU (1%) Community 4 (65 artists): nan (42%), IT (40%), US (6%), GB (6%), CA (2%), JP (2%), FR (2%), PR (2%)
In the pop network, the largest community is dominated by mostly U.S. artists, followed by a Spanish-speaking community, then a Scandinavian one, an Asian community, and finally a dutch one.
In the rap network, the largest community is again centered around U.S. artists, followed by a Portuguese-speaking group, a German and Turkish community, while the fourth is mainly polish artists and the fifth communitiy is primarily italian artists.
Visualization of the POP-network
Using the communities calculated using the louvain algorithm we plot the two networks color coded by community
# Visualize pop network
run_netwulf = False # Convience functionality allowing the notebook to be run from top to bottom. If False, Netwulf is not run and an image is simply shown.
# Change to True to run Netwulf
if run_netwulf:
from utilities.Netwulf_plot_functions import netwulf_plot_communities
from community import community_louvain
communities = pop_communities
colors = ['#e57468', '#68e574', '#7468e5', '#e5d068', '#68d0e5']
netwulf_plot_communities(G_pop, communities, color_palette=colors, path="Pop_network.pdf",zoom=0.76)
if run_netwulf == False:
from IPython.display import Image, display
display(Image('Pop_network.png'))